Introduction#

Welcome to the “M5 Forecasting - Accuracy” competition! In this competition, contestants are challenged to forecast future sales at Walmart based on heirarchical sales in the states of California, Texas, and Wisconsin. Forecasting sales, revenue, and stock prices is a classic application of machine learning in economics, and it is important because it allows investors to make guided decisions based on forecasts made by algorithms.

In this kernel, I will briefly explain the structure of dataset. Then, I will visualize the dataset using Matplotlib and Plotly. And finally, I will demonstrate how this problem can be approached with a variety of forecasting algorithms.

Acknowledgements#

  1. M5 Forecasting - Starter Data Exploration ~ by Rob Mulla

  2. EDA and Baseline Model ~ by RDizzl3

  3. How to Create an ARIMA Model for Time Series Forecasting in Python ~ by Machine Learning Mastery

  4. 7 methods to perform Time Series forecasting (with Python codes) ~ by Analytics Vidhya

  5. Economics for the IB Diploma ~ by Ellie Tragakes

  6. Prophet ~ by Facebook

Contents#

  • The dataset

  • EDA

    • Preparing the ground

    • Sales data

    • Denoising

    • Stores and sales

  • Modeling

    • Train/Val split

    • Naive approach

    • Moving average

    • Holt linear

    • Exponential smoothing

    • ARIMA

    • Prophet

    • Loss for each model

  • Takeaways

  • Ending Note

The dataset #

The dataset consists of five .csv files.

  • calendar.csv - Contains the dates on which products are sold. The dates are in a yyyy/dd/mm format.

  • sales_train_validation.csv - Contains the historical daily unit sales data per product and store [d_1 - d_1913].

  • submission.csv - Demonstrates the correct format for submission to the competition.

  • sell_prices.csv - Contains information about the price of the products sold per store and date.

  • sales_train_evaluation.csv - Available one month before the competition deadline. It will include sales for [d_1 - d_1941].

In this competition, we need to forecast the sales for [d_1942 - d_1969]. These rows form the evaluation set. The rows [d_1914 - d_1941] form the validation set, and the remaining rows form the training set. Now, since we understand the dataset and know what to predict, let us visualize the dataset.

EDA #

Now, I will try to visualize the sales data and gain some insights from it.

Preparing the ground #

Import libraries#

import os
import gc
import time
import math
import datetime
from math import log, floor
from sklearn.neighbors import KDTree

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.utils import shuffle
from tqdm.notebook import tqdm as tqdm

import seaborn as sns
from matplotlib import colors
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import pywt
from statsmodels.robust import mad

import scipy
import statsmodels
from scipy import signal
import statsmodels.api as sm
from prophet import Prophet
from scipy.signal import butter, deconvolve
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt

import warnings
warnings.filterwarnings("ignore")

Load the data#

SALES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv'
CALENDAR_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv'
PRICES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv'
sales_train_val = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv')
sales_train_val.name = 'sales'
calendar = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv')
calendar.name = 'calendar'
selling_prices = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv')
selling_prices.name = 'prices'

Sales data #

Sample sales data#

ids = sorted(list(set(sales_train_val['id'])))
d_cols = [c for c in sales_train_val.columns if 'd_' in c]
x_1 = sales_train_val.loc[sales_train_val['id'] == ids[2]].set_index('id')[d_cols].values[0]
x_2 = sales_train_val.loc[sales_train_val['id'] == ids[1]].set_index('id')[d_cols].values[0]
x_3 = sales_train_val.loc[sales_train_val['id'] == ids[17]].set_index('id')[d_cols].values[0]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_1)), y=x_1, showlegend=False,
                    mode='lines', name="First sample",
                         marker=dict(color="mediumseagreen")),
             row=1, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_2)), y=x_2, showlegend=False,
                    mode='lines', name="Second sample",
                         marker=dict(color="violet")),
             row=2, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_3)), y=x_3, showlegend=False,
                    mode='lines', name="Third sample",
                         marker=dict(color="dodgerblue")),
             row=3, col=1)

fig.update_layout(height=1200, width=800, title_text="Sample sales")
fig.show()

These are sales data from randomly selected stores in the dataset. As expected, the sales data is very erratic, owing to the fact that so many factors affect the sales on a given day. On certain days, the sales quantity is zero, which indicates that a certain product may not be available on that day (as noted by Rob in his kernel).

Sample sales snippets#

ids = sorted(list(set(sales_train_val['id'])))
d_cols = [c for c in sales_train_val.columns if 'd_' in c]
x_1 = sales_train_val.loc[sales_train_val['id'] == ids[0]].set_index('id')[d_cols].values[0][:90]
x_2 = sales_train_val.loc[sales_train_val['id'] == ids[4]].set_index('id')[d_cols].values[0][1300:1400]
x_3 = sales_train_val.loc[sales_train_val['id'] == ids[65]].set_index('id')[d_cols].values[0][350:450]
fig = make_subplots(rows=3, cols=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_1)), y=x_1, showlegend=False,
                    mode='lines+markers', name="First sample",
                         marker=dict(color="mediumseagreen")),
             row=1, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_2)), y=x_2, showlegend=False,
                    mode='lines+markers', name="Second sample",
                         marker=dict(color="violet")),
             row=2, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_3)), y=x_3, showlegend=False,
                    mode='lines+markers', name="Third sample",
                         marker=dict(color="dodgerblue")),
             row=3, col=1)

fig.update_layout(height=1200, width=800, title_text="Sample sales snippets")
fig.show()

In the above plots, I simply zoom in to sample snippets in the sales data. As stated earlier, we can clearly see that the sales data is very erratic and volatile. Sometimes, the sales are zero for a few days in a row, and at other times, it remains at its peak value for a few days. Therefore, we need some sort of “denoising” techniques to find the underlying trends in the sales data and make forecasts.

Denoising #

Now, I will show how these volatile sales prices can be denoised in order to extract underlying trends. This method may lose some information from the original time series, but it may be useful in extracting certain features regarding the trends in the time series.

Wavelet denoising#

Wavelet denoising (usually used with electric signals) is a way to remove the unnecessary noise from a time series. This method calculates coefficients called the “wavelet coefficients”. These coefficients decide which pieces of information to keep (signal) and which ones to discard (noise).

We make use of the MAD (mean absolute deviation) value to understand the randomness in the sales and accordingly decide the minimum threshold for the wavelet coefficients in the time series. We filter out the low coefficients from the wavelets and reconstruct the sales data from the remaining coefficients and that’s it; we have successfully removed noise from the sales data.

def maddest(d, axis=None):
    return np.mean(np.absolute(d - np.mean(d, axis)), axis)

def denoise_signal(x, wavelet='db4', level=1):
    coeff = pywt.wavedec(x, wavelet, mode="per")
    sigma = (1/0.6745) * maddest(coeff[-level])

    uthresh = sigma * np.sqrt(2*np.log(len(x)))
    coeff[1:] = (pywt.threshold(i, value=uthresh, mode='hard') for i in coeff[1:])

    return pywt.waverec(coeff, wavelet, mode='per')
y_w1 = denoise_signal(x_1)
y_w2 = denoise_signal(x_2)
y_w3 = denoise_signal(x_3)


fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_1)), mode='lines+markers', y=x_1, marker=dict(color="mediumaquamarine"), showlegend=False,
               name="Original signal"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_1)), y=y_w1, mode='lines', marker=dict(color="darkgreen"), showlegend=False,
               name="Denoised signal"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_2)), mode='lines+markers', y=x_2, marker=dict(color="thistle"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_2)), y=y_w2, mode='lines', marker=dict(color="purple"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_3)), mode='lines+markers', y=x_3, marker=dict(color="lightskyblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_3)), y=y_w3, mode='lines', marker=dict(color="navy"), showlegend=False),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Original (pale) vs. Denoised (dark) sales")
fig.show()

In the above graphs, the dark lineplots represent the denoised sales and the light lineplots represent the original sales. We can see that Wavelet denoising is able to successfully find the “general trend” in the sales data without getting distracted by the noise. Finding these high- trends or patterns in the sales may be useful in generating features to train a model.

The below diagram illustrates these graphs side-by-side. Red graphs represent original sales and green graphs represent denoised sales.

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(30, 20))

ax[0, 0].plot(x_1, color='seagreen', marker='o') 
ax[0, 0].set_title('Original Sales', fontsize=24)
ax[0, 1].plot(y_w1, color='red', marker='.') 
ax[0, 1].set_title('After Wavelet Denoising', fontsize=24)

ax[1, 0].plot(x_2, color='seagreen', marker='o') 
ax[1, 0].set_title('Original Sales', fontsize=24)
ax[1, 1].plot(y_w2, color='red', marker='.') 
ax[1, 1].set_title('After Wavelet Denoising', fontsize=24)

ax[2, 0].plot(x_3, color='seagreen', marker='o') 
ax[2, 0].set_title('Original Sales', fontsize=24)
ax[2, 1].plot(y_w3, color='red', marker='.') 
ax[2, 1].set_title('After Wavelet Denoising', fontsize=24)

plt.show()
../_images/d7280c6e1fcfbfd3640d009eb491909c1b5a7a2fee232f8605d10dd33e9b9721.png

Average smoothing#

Average smooting is a relatively simple way to denoise time series data. In this method, we take a “window” with a fixed size (like 10). We first place the window at the beginning of the time series (first ten elements) and calculate the mean of that section. We now move the window across the time series in the forward direction by a particular “stride”, calculate the mean of the new window and repeat the process, until we reach the end of the time series. All the mean values we calculated are then concatenated into a new time series, which forms the denoised sales data.

def average_smoothing(signal, kernel_size=3, stride=1):
    sample = []
    start = 0
    end = kernel_size
    while end <= len(signal):
        start = start + stride
        end = end + stride
        sample.extend(np.ones(end - start)*np.mean(signal[start:end]))
    return np.array(sample)
y_a1 = average_smoothing(x_1)
y_a2 = average_smoothing(x_2)
y_a3 = average_smoothing(x_3)

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_1)), mode='lines+markers', y=x_1, marker=dict(color="lightskyblue"), showlegend=False,
               name="Original sales"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_1)), y=y_a1, mode='lines', marker=dict(color="navy"), showlegend=False,
               name="Denoised sales"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_2)), mode='lines+markers', y=x_2, marker=dict(color="thistle"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_2)), y=y_a2, mode='lines', marker=dict(color="indigo"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_3)), mode='lines+markers', y=x_3, marker=dict(color="mediumaquamarine"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(len(x_3)), y=y_a3, mode='lines', marker=dict(color="darkgreen"), showlegend=False),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Original (pale) vs. Denoised (dark) signals")
fig.show()

In the above graphs, the dark lineplots represent the denoised sales and the light lineplots represent the original sales. We can see that average smoothing is not as effective as Wavelet denoising at finding macroscopic trends and pattersns in the data. A lot of the noise in the original sales persists even after denoising. Therefore, wavelet denoising is clearly more effective at finding trends in the sales data. Nonetheless, average smoothing or “rolling mean” can also be used to calculate useful features for modeling.

The below diagram illustrates these graphs side-by-side. Red graphs represent original sales and green graphs represent denoised sales.

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(30, 20))

ax[0, 0].plot(x_1, color='seagreen', marker='o') 
ax[0, 0].set_title('Original Sales', fontsize=24)
ax[0, 1].plot(y_a1, color='red', marker='.') 
ax[0, 1].set_title('After Wavelet Denoising', fontsize=24)

ax[1, 0].plot(x_2, color='seagreen', marker='o') 
ax[1, 0].set_title('Original Sales', fontsize=24)
ax[1, 1].plot(y_a2, color='red', marker='.') 
ax[1, 1].set_title('After Wavelet Denoising', fontsize=24)

ax[2, 0].plot(x_3, color='seagreen', marker='o') 
ax[2, 0].set_title('Original Sales', fontsize=24)
ax[2, 1].plot(y_a3, color='red', marker='.') 
ax[2, 1].set_title('After Wavelet Denoising', fontsize=24)

plt.show()
../_images/dd275f28737ae23acb9573333fa5d8f4cb70263db6d84b98679c42f6aa220547.png

Stores and states #

Now, I will look at the sales data across different stores and states in order to gain some useful insights.

Rolling Average Price vs. Time for each store#

past_sales = sales_train_val.set_index('id')[d_cols] \
    .T \
    .merge(calendar.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date')

store_list = selling_prices['store_id'].unique()
means = []
fig = go.Figure()
for s in store_list:
    store_items = [c for c in past_sales.columns if s in c]
    data = past_sales[store_items].sum(axis=1).rolling(90).mean()
    means.append(np.mean(past_sales[store_items].sum(axis=1)))
    fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (per store)")

In the above graph, I have plotted rolling sales across all stores in the dataset. Almost every sales curve has “linear oscillation” trend at the macroscopic level. Basically, the sales oscillate like a sine wave about a certain mean value, but this mean value has an upward linear trend. This implies that the sales are oscillating at a higher and higher level every few months.

This trend is reminiscent of the business cycle, where economies have short-term oscillatory fluctuations but grow linearly in the long run. Maybe, such small-scale trends at the level of stores add up to decide trends we see at the macroeconomic level. Below is an illustration of the macroeconomic business cycle:

fig = go.Figure()

for i, s in enumerate(store_list):
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        fig.add_trace(go.Box(x=[s]*len(data), y=data, name=s))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Store name ")

The above plot compares the sales distribution for each store in the dataset. The stores in California seem to have the highest variance in sales, which might indicate that some places in California grow significantly faster than others, i.e. there is development disparity. On the other hand, the Wisconsin and Texas sales seem to be quite consistent among themselves, without much variance. This indicates that development might be more uniform in these states. Moreover, the California stores also seem to have the highest overall mean sales.

df = pd.DataFrame(np.transpose([means, store_list]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name")

From the above graph, we can see the same trends: California stores have the highest variance and mean sales among all the stores in the dataset.

Rolling Average Price vs. Time (CA)#

greens = ["mediumaquamarine", "mediumseagreen", "seagreen", "green"]
store_list = selling_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "ca" in s or "CA" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=greens[i])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (California)")

In the above graph, we can see the large disparity in sales among California stores. The sales curves almost never intersect each other. This may indicate that there are certain “hubs” of development in California which do not change over time. And other areas always remain behind these “hubs”. The average sales in descending order are CA_3, CA_1, CA_2, CA_4. The store CA_3 has maximum sales while the store CA_4 has minimum sales.

fig = go.Figure()

for i, s in enumerate(store_list):
    if "ca" in s or "CA" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        fig.add_trace(go.Box(x=[s]*len(data), y=data, name=s, marker=dict(color=greens[i])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Store name (California)")
df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : greens})])

fig.update_layout(title="Mean sales vs. Store name (California)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

In the above plots, we can see the same relationship. The store CA_3 has maximum sales while the store CA_4 has minimum sales.

Rolling Average Price vs. Time (WI)#

purples = ["thistle", "violet", "purple", "indigo"]
store_list = selling_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "wi" in s or "WI" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=purples[i%len(purples)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (Wisconsin)")

In the above graph, we can see a very low disparity in sales among Wisconsin stores. The sales curves intersect each other very often. This may indicate that most parts of Wisconsin have a similar “development curve” and that there is a greater equity in development across the state. There are no specific “hotspots” or “hubs” of development. The average sales in descending order are WI_2, WI_3, WI_1. The store WI_2 has maximum sales while the store WI_1 has minimum sales.

fig = go.Figure()

for i, s in enumerate(store_list):
    if "wi" in s or "WI" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        fig.add_trace(go.Box(x=[s]*len(data), y=data, name=s, marker=dict(color=purples[i%len(purples)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Store name (Wisconsin)")
df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : purples})])

fig.update_layout(title="Mean sales vs. Store name (Wisconsin)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

In the above plots, we can see the same relationship. The store W1_2 has maximum sales while the store W1_1 has minimum sales.

Rolling Average Price vs. Time (TX)#

blues = ["skyblue", "dodgerblue", "darkblue"]
store_list = selling_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "tx" in s or "TX" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=blues[i%len(blues)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (Texas)")

In the above graph, we can once again see that a very low disparity in sales among Texas stores. The sales curves intersect each other often, albeit not as often as in Wisconsin. This might once again indicate that most parts of Texas have a similar “development curve” and that there is a greater equity in development across the state. The variance here is higher than in Wisconsin though, so there might be “hubs” of development in Texas as well, but not as pronounced as in California. The average sales in descending order are TX_2, TX_3, TX_1. The store TX_2 has maximum sales while the store TX_1 has minimum sales.

fig = go.Figure()

for i, s in enumerate(store_list):
    if "tx" in s or "TX" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        fig.add_trace(go.Box(x=[s]*len(data), y=data, name=s, marker=dict(color=blues[i%len(blues)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Store name (Texas)")
df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : blues})])

fig.update_layout(title="Mean sales vs. Store name (Texas)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

In the above plots, we can see the same relationship. The store TX_2 has maximum sales while the store TX_1 has minimum sales.

Modeling #

Now, I will demonstrate how sales can be forecasted using various methods, namely: naive approach, moving average, Holt linear, exponential smoothing, ARIMA, and Prophet

Train/Val split #

First, we need to create miniature training and validation sets to train and validate our models. I will use the last 30 days’ sales as the validation data and the sales of the 70 days before that as the training data. We need to predict the sales in the validation data using the sales in the training data.

train_dataset = sales_train_val[d_cols[-100:-30]]
val_dataset = sales_train_val[d_cols[-30:]]

Below are the sales from three sample data points. I will use these samples to demonstrate the working of the models.

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"), showlegend=False,
               name="Original signal"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"), showlegend=False,
               name="Denoised signal"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Train (blue) vs. Validation (orange) sales")
fig.show()

Naive approach #

The first approach is the very simple naive approach. It simply forecasts the next day’s sales as the current day’s sales. The model can be summarized as follows:

In the above equation, yt+1 is the predicted value for the next day’s sales and yt is today’s sales. The model predicts tomorrow’s sales as today’s sales. Now let us see how this simple model performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

predictions = []
for i in range(len(val_dataset.columns)):
    if i == 0:
        predictions.append(train_dataset[train_dataset.columns[-1]].values)
    else:
        predictions.append(val_dataset[val_dataset.columns[i-1]].values)
    
predictions = np.transpose(np.array([row.tolist() for row in predictions]))
error_naive = np.linalg.norm(predictions[:3] - val_dataset.values[:3])/len(predictions[0])
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Naive approach")
fig.show()

We can see that the forecasts made by the naive approach are not accurate and it is to be expected of such a simple algorithm. We need more complex models which use several time stamps to make forecasts.

Moving average #

The moving average method is more complex than the naive approach. It calculates the mean sales over the previous 30 (or any other number) days and forecasts that as the next day’s sales. This method takes the previous 30 timesteps into consideration, and is therefore less prone to short term fluctuations than the naive approach. The model can be summarized as follows:

In the above equation, yt+1 is tomorrow’s sales. On the right hand side, all the sales for the previous 30 days are added up and divided by 30 to find the average. This forms the model’s prediction, yt+1. Now let us see how this new model performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

predictions = []
for i in range(len(val_dataset.columns)):
    if i == 0:
        predictions.append(np.mean(train_dataset[train_dataset.columns[-30:]].values, axis=1))
    if i < 31 and i > 0:
        predictions.append(0.5 * (np.mean(train_dataset[train_dataset.columns[-30+i:]].values, axis=1) + \
                                  np.mean(predictions[:i], axis=0)))
    if i > 31:
        predictions.append(np.mean([predictions[:i]], axis=1))
    
predictions = np.transpose(np.array([row.tolist() for row in predictions]))
error_avg = np.linalg.norm(predictions[:3] - val_dataset.values[:3])/len(predictions[0])
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Moving average")
fig.show()

We can see that this model performs better than the naive approach. It is less susceptible to the volatility in day-to-day sales data and manages to pick up trends with slightly higher accuracy. However, it is still unable to find high-level trends in the sales.

Holt linear #

The Holt linear is completely different from the first two methods. Holt linear attempts to capture the high-level trends in time series data using a linear function. The method can be summarized as follows:

Forecast, level, and trend equations respectively#

In the above equations, \(\alpha\) and \(\beta\) are constants which can be configured. The values lt and bt represent the level and trend values repsectively. The trend value is the slope of the linear forecast function and the level value is the y-intercept of the linear forecast function. The slope and y-intercept values are continuously updated using the second and third update equations. Finally, the slope and y-intercept values are used to calculate the forecast, yt+h (in equation 1), which is h time steps ahead of the current time step. Now let us see how this model performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

predictions = []
for row in tqdm(train_dataset[train_dataset.columns[-30:]].values[:3]):
    fit = Holt(row).fit(smoothing_level = 0.3, smoothing_slope = 0.01)
    predictions.append(fit.forecast(30))
predictions = np.array(predictions).reshape((-1, 30))
error_holt = np.linalg.norm(predictions - val_dataset.values[:len(predictions)])/len(predictions[0])
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Holt linear")
fig.show()

We can see that Holt linear is able to predict high-level trends in the sales very consistently. But, it is not able to capture the short-term volatility in the sales as accurately as other methods. Maybe this method can be combined with other low-level forecasters to produce better results.

Exponential smoothing #

The exponential smoothing method uses a different type of smoothing which differs from average smoothing. The previous time steps are exponentially weighted and added up to generate the forecast. The weights decay as we move further backwards in time. The model can be summarized as follows:

In the above equations, \(\alpha\) is the smoothing parameter. The forecast yt+1 is a weighted average of all the observations in the series y1, … ,yt. The rate at which the weights decay is controlled by the parameter α. This method gives different weightage to different time steps, instead of giving the same weightage to all time steps (like the moving average method). This ensures that recent sales data is given more importance than old sales data while making the forecast. Now let us see how this new smoothing method performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

predictions = []
for row in tqdm(train_dataset[train_dataset.columns[-30:]].values[:3]):
    fit = ExponentialSmoothing(row, seasonal_periods=3).fit()
    predictions.append(fit.forecast(30))
predictions = np.array(predictions).reshape((-1, 30))
error_exponential = np.linalg.norm(predictions[:3] - val_dataset.values[:3])/len(predictions[0])
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Exponential smoothing")
fig.show()

We can see that exponential smoothing is generating a horizontal line every time. This is because it gives very low weightage to faraway time steps, causing the predictions to flatten out or remain constant. However, it is able to predict the mean sales with excellent accuracy.

ARIMA #

ARIMA stands for Auto Regressive Integrated Moving Average. While exponential smoothing models were based on a description of trend and seasonality in data, ARIMA models aim to describe the correlations in the time series.

Now let us see how ARIMA performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

predictions = []
for row in tqdm(train_dataset[train_dataset.columns[-30:]].values[:3]):
    fit = sm.tsa.statespace.SARIMAX(row, seasonal_order=(0, 1, 1, 7)).fit()
    predictions.append(fit.forecast(30))
predictions = np.array(predictions).reshape((-1, 30))
error_arima = np.linalg.norm(predictions[:3] - val_dataset.values[:3])/len(predictions[0])
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.51391D+00    |proj g|=  3.29224D-01

At iterate    5    f=  1.37351D+00    |proj g|=  1.59405D-02

At iterate   10    f=  1.37241D+00    |proj g|=  2.36062D-03

At iterate   15    f=  1.37232D+00    |proj g|=  7.52749D-04

At iterate   20    f=  1.37230D+00    |proj g|=  1.34413D-03

At iterate   25    f=  1.37230D+00    |proj g|=  4.48013D-04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     28     34      1     0     0   7.643D-06   1.372D+00
  F =   1.3723005599156413     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.08182D-01    |proj g|=  4.33657D-01
 This problem is unconstrained.
 This problem is unconstrained.
At iterate    5    f=  2.96080D-01    |proj g|=  9.11210D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3      9     13      1     0     0   1.278D-05   2.960D-01
  F =  0.29600077710409095     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.65736D-01    |proj g|=  1.99376D-01

At iterate    5    f=  8.84023D-01    |proj g|=  1.91855D-02

At iterate   10    f=  8.83171D-01    |proj g|=  8.69850D-03

At iterate   15    f=  8.83067D-01    |proj g|=  4.74831D-03
 This problem is unconstrained.
At iterate   20    f=  8.83051D-01    |proj g|=  8.49232D-04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     23     31      1     0     0   1.570D-04   8.831D-01
  F =  0.88305086938968835     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="ARIMA")
fig.show()

ARIMA is able to find low-level and high-level trends simultaneously, unlike most other models which can only find one of these. It is able to predict a periodic function for each sample, and these functions seem to be pretty accurate (except for the second sample).

Prophet #

Prophet is an opensource time series forecasting project by Facebook. It is based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, including holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. It is also supposed to be more robust to missing data and shifts in trend compared to other models.

Now let us see how Prophet performs on our miniature dataset. The training data is in blue, validation data in orange, and predictions in green.

dates = ["2007-12-" + str(i) for i in range(1, 31)]
predictions = []
for row in tqdm(train_dataset[train_dataset.columns[-30:]].values[:3]):
    df = pd.DataFrame(np.transpose([dates, row]))
    df.columns = ["ds", "y"]
    model = Prophet(daily_seasonality=True)
    model.fit(df)
    future = model.make_future_dataframe(periods=30)
    forecast = model.predict(future)["yhat"].loc[30:].values
    predictions.append(forecast)
predictions = np.array(predictions).reshape((-1, 30))
error_prophet = np.linalg.norm(predictions[:3] - val_dataset.values[:3])/len(predictions[0])
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 23.
Initial log joint probability = -28.9301
Iteration  1. Log joint probability =    9.18298. Improved by 38.113.
Iteration  2. Log joint probability =    25.4207. Improved by 16.2377.
Iteration  3. Log joint probability =    26.0977. Improved by 0.676986.
Iteration  4. Log joint probability =    26.3816. Improved by 0.283952.
Iteration  5. Log joint probability =    26.3958. Improved by 0.0141368.
Iteration  6. Log joint probability =    26.4112. Improved by 0.0154235.
Iteration  7. Log joint probability =    26.4137. Improved by 0.00253877.
Iteration  8. Log joint probability =    26.4214. Improved by 0.00766101.
Iteration  9. Log joint probability =    26.4299. Improved by 0.00854073.
Iteration 10. Log joint probability =    26.4331. Improved by 0.0031661.
Iteration 11. Log joint probability =    26.4368. Improved by 0.0037265.
Iteration 12. Log joint probability =    26.4384. Improved by 0.00158862.
Iteration 13. Log joint probability =    26.4395. Improved by 0.00107142.
Iteration 14. Log joint probability =    26.4396. Improved by 0.000110604.
Iteration 15. Log joint probability =    26.4405. Improved by 0.000850511.
Iteration 16. Log joint probability =    26.4405. Improved by 9.15856e-05.
Iteration 17. Log joint probability =    26.4407. Improved by 0.00019916.
Iteration 18. Log joint probability =    26.4408. Improved by 0.000100194.
Iteration 19. Log joint probability =    26.4409. Improved by 8.40284e-05.
Iteration 20. Log joint probability =     26.441. Improved by 4.40602e-05.
Iteration 21. Log joint probability =     26.441. Improved by 2.9532e-05.
Iteration 22. Log joint probability =     26.441. Improved by 6.03611e-06.
Iteration 23. Log joint probability =     26.441. Improved by 5.06874e-06.
Iteration 24. Log joint probability =     26.441. Improved by 1.39291e-06.
Iteration 25. Log joint probability =     26.441. Improved by 1.15328e-06.
Iteration 26. Log joint probability =     26.441. Improved by 1.18563e-06.
Iteration 27. Log joint probability =     26.441. Improved by 4.96889e-07.
Iteration 28. Log joint probability =     26.441. Improved by 3.68873e-07.
Iteration 29. Log joint probability =     26.441. Improved by 1.25032e-07.
Iteration 30. Log joint probability =     26.441. Improved by 2.18027e-08.
Iteration 31. Log joint probability =     26.441. Improved by 9.02753e-08.
Iteration 32. Log joint probability =     26.441. Improved by 3.62999e-08.
Iteration 33. Log joint probability =     26.441. Improved by 1.66252e-08.
Iteration 34. Log joint probability =     26.441. Improved by 2.25338e-09.
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 23.
Initial log joint probability = -30.3946
Iteration  1. Log joint probability =     3.7117. Improved by 34.1063.
Iteration  2. Log joint probability =    10.0115. Improved by 6.29977.
Iteration  3. Log joint probability =    15.8254. Improved by 5.81397.
Iteration  4. Log joint probability =    16.0828. Improved by 0.257403.
Iteration  5. Log joint probability =    16.0901. Improved by 0.00730186.
Iteration  6. Log joint probability =    16.1216. Improved by 0.0314168.
Iteration  7. Log joint probability =    16.1288. Improved by 0.00724291.
Iteration  8. Log joint probability =    16.2066. Improved by 0.0778419.
Iteration  9. Log joint probability =    16.4858. Improved by 0.279201.
Iteration 10. Log joint probability =    16.5475. Improved by 0.0616134.
Iteration 11. Log joint probability =    16.5752. Improved by 0.0276983.
Iteration 12. Log joint probability =       16.6. Improved by 0.0248926.
Iteration 13. Log joint probability =    16.6872. Improved by 0.0871808.
Iteration 14. Log joint probability =    16.7241. Improved by 0.0368782.
Iteration 15. Log joint probability =    16.8667. Improved by 0.142565.
Iteration 16. Log joint probability =    16.8861. Improved by 0.0194048.
Iteration 17. Log joint probability =    16.8969. Improved by 0.0108252.
Iteration 18. Log joint probability =    16.9152. Improved by 0.0183155.
Iteration 19. Log joint probability =    16.9235. Improved by 0.00826583.
Iteration 20. Log joint probability =    16.9287. Improved by 0.00525314.
Iteration 21. Log joint probability =    16.9298. Improved by 0.0010506.
Iteration 22. Log joint probability =    16.9323. Improved by 0.00252011.
Iteration 23. Log joint probability =    16.9325. Improved by 0.000219415.
Iteration 24. Log joint probability =     16.935. Improved by 0.00251261.
Iteration 25. Log joint probability =    16.9365. Improved by 0.00145798.
Iteration 26. Log joint probability =    16.9369. Improved by 0.00037017.
Iteration 27. Log joint probability =    16.9369. Improved by 1.41197e-06.
Iteration 28. Log joint probability =    16.9369. Improved by 8.28238e-05.
Iteration 29. Log joint probability =    16.9371. Improved by 0.000145104.
Iteration 30. Log joint probability =    16.9374. Improved by 0.000263075.
Iteration 31. Log joint probability =    16.9375. Improved by 0.000121901.
Iteration 32. Log joint probability =    16.9375. Improved by 6.74079e-05.
Iteration 33. Log joint probability =    16.9376. Improved by 5.16524e-05.
Iteration 34. Log joint probability =    16.9376. Improved by 3.18956e-05.
Iteration 35. Log joint probability =    16.9377. Improved by 2.51147e-05.
Iteration 36. Log joint probability =    16.9377. Improved by 1.30803e-05.
Iteration 37. Log joint probability =    16.9377. Improved by 1.82392e-05.
Iteration 38. Log joint probability =    16.9377. Improved by 9.59644e-06.
Iteration 39. Log joint probability =    16.9377. Improved by 1.57203e-06.
Iteration 40. Log joint probability =    16.9377. Improved by 2.57797e-06.
Iteration 41. Log joint probability =    16.9377. Improved by 1.78773e-06.
Iteration 42. Log joint probability =    16.9377. Improved by 3.56632e-06.
Iteration 43. Log joint probability =    16.9377. Improved by 8.2788e-07.
Iteration 44. Log joint probability =    16.9377. Improved by 1.1026e-06.
Iteration 45. Log joint probability =    16.9377. Improved by 2.73176e-06.
Iteration 46. Log joint probability =    16.9377. Improved by 5.50919e-07.
Iteration 47. Log joint probability =    16.9377. Improved by 8.07523e-07.
Iteration 48. Log joint probability =    16.9377. Improved by 5.3564e-07.
Iteration 49. Log joint probability =    16.9377. Improved by 1.30242e-07.
Iteration 50. Log joint probability =    16.9377. Improved by 1.43464e-07.
Iteration 51. Log joint probability =    16.9377. Improved by 4.28049e-08.
Iteration 52. Log joint probability =    16.9377. Improved by 1.22815e-07.
Iteration 53. Log joint probability =    16.9377. Improved by 5.07608e-08.
Iteration 54. Log joint probability =    16.9377. Improved by 5.72316e-08.
Iteration 55. Log joint probability =    16.9377. Improved by 3.13832e-08.
Iteration 56. Log joint probability =    16.9377. Improved by 1.06794e-08.
Iteration 57. Log joint probability =    16.9377. Improved by 2.8419e-08.
Iteration 58. Log joint probability =    16.9377. Improved by 9.76502e-09.
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 23.
Initial log joint probability = -28.3381
Iteration  1. Log joint probability =    13.8058. Improved by 42.1439.
Iteration  2. Log joint probability =    15.5715. Improved by 1.7657.
Iteration  3. Log joint probability =    22.2379. Improved by 6.66635.
Iteration  4. Log joint probability =    22.8565. Improved by 0.618687.
Iteration  5. Log joint probability =    22.9164. Improved by 0.0598917.
Iteration  6. Log joint probability =    23.0844. Improved by 0.167988.
Iteration  7. Log joint probability =    23.1082. Improved by 0.0237693.
Iteration  8. Log joint probability =    23.1302. Improved by 0.0220339.
Iteration  9. Log joint probability =    23.1394. Improved by 0.00919653.
Iteration 10. Log joint probability =    23.1605. Improved by 0.0210343.
Iteration 11. Log joint probability =    23.1609. Improved by 0.000406794.
Iteration 12. Log joint probability =     23.174. Improved by 0.0131328.
Iteration 13. Log joint probability =    23.1771. Improved by 0.00313613.
Iteration 14. Log joint probability =    23.1788. Improved by 0.0017176.
Iteration 15. Log joint probability =    23.1816. Improved by 0.00272397.
Iteration 16. Log joint probability =    23.1827. Improved by 0.00111666.
Iteration 17. Log joint probability =    23.1834. Improved by 0.000737845.
Iteration 18. Log joint probability =    23.1837. Improved by 0.000321163.
Iteration 19. Log joint probability =    23.1838. Improved by 4.31438e-05.
Iteration 20. Log joint probability =     23.184. Improved by 0.000160253.
Iteration 21. Log joint probability =     23.184. Improved by 1.49891e-05.
Iteration 22. Log joint probability =     23.184. Improved by 7.57238e-05.
Iteration 23. Log joint probability =     23.184. Improved by 8.59493e-06.
Iteration 24. Log joint probability =    23.1841. Improved by 5.14383e-05.
Iteration 25. Log joint probability =    23.1841. Improved by 2.52388e-05.
Iteration 26. Log joint probability =    23.1841. Improved by 2.90125e-06.
Iteration 27. Log joint probability =    23.1841. Improved by 1.63269e-06.
Iteration 28. Log joint probability =    23.1841. Improved by 4.82639e-06.
Iteration 29. Log joint probability =    23.1841. Improved by 2.9948e-06.
Iteration 30. Log joint probability =    23.1841. Improved by 1.05341e-06.
Iteration 31. Log joint probability =    23.1841. Improved by 2.21765e-07.
Iteration 32. Log joint probability =    23.1841. Improved by 7.35915e-08.
Iteration 33. Log joint probability =    23.1841. Improved by 6.64259e-07.
Iteration 34. Log joint probability =    23.1841. Improved by 3.2945e-07.
Iteration 35. Log joint probability =    23.1841. Improved by 1.66581e-08.
Iteration 36. Log joint probability =    23.1841. Improved by 2.35331e-08.
Iteration 37. Log joint probability =    23.1841. Improved by 1.98336e-07.
Iteration 38. Log joint probability =    23.1841. Improved by 5.31895e-08.
Iteration 39. Log joint probability =    23.1841. Improved by 7.55865e-09.
pred_1 = predictions[0]
pred_2 = predictions[1]
pred_3 = predictions[2]

fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[0].values, marker=dict(color="dodgerblue"),
               name="Train"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[0].values, mode='lines', marker=dict(color="darkorange"),
               name="Val"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_1, mode='lines', marker=dict(color="seagreen"),
               name="Pred"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[1].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[1].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_2, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70), mode='lines', y=train_dataset.loc[2].values, marker=dict(color="dodgerblue"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=val_dataset.loc[2].values, mode='lines', marker=dict(color="darkorange"), showlegend=False),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=np.arange(70, 100), y=pred_3, mode='lines', marker=dict(color="seagreen"), showlegend=False,
               name="Denoised signal"),
    row=3, col=1
)

fig.update_layout(height=1200, width=800, title_text="Prophet")
fig.show()

Prophet appears to output very similar-shaped predictions to ARIMA. But on closer observation, we can see that the there is a macroscopic upward trend which was absent in ARIMA. In the ARIMA predictions, the exact same pattern was repeated. But in the Prophet predictions, the same pattern is shifted vertically at each oscillation. This shows that is able to capture high-level trends better than ARIMA.

days = range(1, 1913 + 1)
time_series_columns = [f'd_{i}' for i in days]
time_series_data = sales_train_val[time_series_columns]
forecast = pd.DataFrame(time_series_data.iloc[:, -28:].mean(axis=1))
forecast = pd.concat([forecast] * 28, axis=1)
forecast.columns = [f'F{i}' for i in range(1, forecast.shape[1] + 1)]
validation_ids = sales_train_val['id'].values
evaluation_ids = [i.replace('validation', 'evaluation') for i in validation_ids]
ids = np.concatenate([validation_ids, evaluation_ids])
predictions = pd.DataFrame(ids, columns=['id'])
forecast = pd.concat([forecast] * 2).reset_index(drop=True)
predictions = pd.concat([predictions, forecast], axis=1)
predictions.to_csv('submission.csv', index=False)

Loss for each model #

error = [error_naive, error_avg, error_holt, error_exponential, error_arima, error_prophet]
names = ["Naive approach", "Moving average", "Holt linear", "Exponential smoothing", "ARIMA", "Prophet"]
df = pd.DataFrame(np.transpose([error, names]))
df.columns = ["RMSE Loss", "Model"]
px.bar(df, y="RMSE Loss", x="Model", color="Model", title="RMSE Loss vs. Model")

From the above graph, we can see that the two smoothing methods: moving average and exponential smoothing are the best-scoring models. Holt linear is not far behind. The remaining models: naive approach, ARIMA, and Prophet are the worst-scoring models. I believe that the accuracy of ARIMA and Prophet can be boosted significantly by tuning the hyperparameters.

Takeaways#

  • Different states have different mean and variance of sales, indicating differences in the distribution of development in these states.

  • Most sales have a linearly trended sine wave shape, reminiscent of the macroeconomic business cycle.

  • Several non-ML models can be used to forecast time series data. Moving average and exponential smoothing are very good models.

  • ARIMA and Prophet’s performance can be boosted with more hyperparamter tuning.

Resources#

This demo was adapted from the following kaggle notebook.